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Abstract: When in a full exponential family the maximum likelihood es- 
1 timate (MLE) does not exist, the MLE may exist in the Barndorff-Niclsen 

completion of the family. We propose a practical algorithm for finding the 

1 I ' MLE in the completion based on repeated linear programming using the 

i R contributed package redd and illustrate it with two generalized linear 

• model examples. When the MLE for the null hypothesis lies in the comple- 

tion, likelihood ratio tests of model comparison are almost unchanged from 
the usual case. Only the degrees of freedom need to be adjusted. When the 
^ ' MLE lies in the completion, confidence intervals are changed much more 

S from the usual case. The MLE of the natural parameter can be thought of 
as having gone to infinity in a certain direction, which we call a generic di- 
rection of recession. We propose a new one-sided confidence interval which 
says how close to infinity the natural parameter may be. This maps to 
,— ' one-sided confidence intervals for mean values showing how close to the 

^ ' boundary of their support they may be. 
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1. Introduction 

The problem addressed in this article is widespread. Many users of statistics 
have run into it, although they may not have been aware of it. The problem 
has been well understood for thirty years, but until now convenient software to 
handle the problem has not been available. 

In a discrete exponential family, for example, logistic regression or categorical 
data analysis, it can happen that the maximum likelihood estimate (MLE) does 
not exist in the conventional sense. This is often not detected by software, so 
users may be unaware of the situation. Whether they are aware or not, available 
software provides no support for valid statistical inference in this situation. The 
aster contributed package for R (6) has a check for near-singularity of the 
Fisher information matrix, which indicates either nonexistence of the MLE or 
near collincarity of predictors, and this check has revealed a number of instances 
where nonexistence of the MLE arose in actual applications. 

When the MLE does not exist in the conventional sense, it may exist in 
the Barndorff-Nielsen completion of the family (2; 3; 5, and references cited 
at the beginning of Chapter 2 of the latter) . A practical algorithm for finding 
the MLE in the Barndorff-Nielsen completion of an exponential family using 
repeated linear programming was proposed in the author's unpublished thesis 
(5) and used in (10). Now we propose a different method, also using repeated 
linear programming with the R contributed package redd (9). We also propose 
methods for hypothesis tests and confidence intervals valid when the MLE docs 
not exist in the conventional sense. Our methods only work for full exponential 
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families but arc simpler than the methods of (5) which worked for non-full 
convex families. 

Our examples are chosen to be good examples for teaching statisticians how 
to use our methods. All the analysis discussed in this article is carried out in 
full in the accompanying technical report (7), which is produced using the R 
function Sweave so all results in the report are actually produced by the code 
shown therein and hence are fully reproducible by anyone who has R. 

2. Examples 

2.1. A Binomial Example 

The binomial distribution provides the simplest example. Suppose x is bino- 
mial with sample size n and success probability p. The MLE for p is p = x/n. 
So far, so good, but three things are different about the cases p = and p = l. 
First, the natural parameter is 9 = logit(p), and 9 = logit(p) does not exist when 
p = or p = 1. Second, the probability distribution corresponding to the MLE 
is degenerate, the binomial distribution with success probability p = or p = 1 
being concentrated at x = and x = n, respectively. Third, the elementary 95% 
confidence interval p ± 1.96-y/p(l — p)/n does not work when p = or p = 1. 

2.2. A Logistic Regression Example 

The same kind of problems arise in multiparameter exponential families but 
the relevant multidimensional geometry is much harder to visualize. To intro- 
duce ideas, consider what is still a relatively simple example, which is a logistic 
regression. Suppose we observe a vector y whose components arc Bernoulli with 
means forming a vector p. The natural parameter is 9 — logit(p), where logit op- 
erates componentwise 9i = logit (p,). Suppose we also have one covariate vector 
x and we want to fit a quadratic model 



Finally, suppose Xi takes the values 1, . . ., 30 and Ui = for Xi < 12 or Xi > 24 
and yi — 1 otherwise. 

If we try to fit these data using the R function glm, it complains "algorithm 
did not converge, fitted probabilities numerically or 1 occurred." In fact, the 
MLE for (3 does not exist. Define the vector 



then we say i5 is a generic direction of recession (GDOR) because there exists a 



9,.= (3!+ f3 2 Xi + fcxl 



5 = (-587,72,-2) 



(1) 



$ such that 



lim l(/3 + s5) 



sup l(/3), 



(2) 



s 
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where / is the log likelihood function. Although we write ft here, this does not 
denote the MLE — the MLE does not exist — we can consider the MLE for 
ft to be ft sent to infinity in the direction 8. Neither ft nor 6 satisfying (2) are 
unique. In this example, (2) actually holds for all ft G M 3 when S is given by (1). 

If we consider what happens to the mean value parameter vector p, we find 
that p(ft + sS) — ► y as s — > oo. Thus MLE for the mean value parameter vector 
does exist and is equal to the observed data vector. This MLE is strange because 
the distribution it goes with is degenerate, concentrated at the observed data. 
The MLE distribution says we can never observe data other than what we did 
observe; other data values occur with probability zero. 

This degeneracy need not cause problems for statistical inference. The sample 
is not the population, and estimates are not parameters. What we need is a 
confidence interval, necessarily one-sided, that says how close ft is to infinity 
and how close p is to y. 

Fix a choice of ft, say ft = 0, and consider for all real s the probability 
distribution having parameter ft + sS. As s goes from — oo to +oo the probability 
of seeing the data value actually observed goes from zero to one. Find the unique 
s that makes this probability 0.05, call it s, then [s, oo) is a 95% confidence 
interval for the scalar parameter s. Figure 1 shows the mean value parameter- 
values corresponding to the ends of this confidence interval (s and oo). These 
intervals are not the best we can do; Figure 2 shows improved intervals that 
require more computation. 

We summarize this example, explaining which features are general and which 
are not. First, the GDOR notion is perfectly general. When the MLE for the 
natural parameter does not exist, then under conditions of Brown (3) that hold 
in all practical examples, there is a GDOR, and the likelihood is maximized by 
going to infinity in that direction. Second, the MLE for the mean value parame- 
ter for this example corresponds to a completely degenerate distribution, which 
is not general. Usually it will be only partially degenerate, some components 
of the response being fixed at an extreme value but other components being 
random under the MLE distribution. Third, the MLE for the natural parameter 
"is" any ft plus infinity in the direction of the GDOR, which is not general. 
Usually only some ft will work in (2). Fourth, Figure 1 shows most of what is 
important about this example, but in general there is no such figure. Usually, 
hypothesis tests and confidence intervals are all that can be reported. 

2.3. A Contingency Table Example 

This example isa2x2x---x2 contingency table with seven dimensions hence 
2 7 = 128 cells. The file http://www.stat.umn.edu/geyer/gdor/catrec.txt 
presents the data as eight vectors, seven categorical predictors v\, . . ., V7 that 
specify the cells of the contingency table and one response y that gives the cell 
counts. 

To simplify the discussion, we take the cell counts to be independent Poisson 
random variables so we have a generalized linear model. The R statement 
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Fig 1 . Quick and dirty confidence intervals for the mean value parameter. Outer points having 
values zero or one are MLE for the mean value parameter, which are also the components of 
the observed data vector. Inner points are mean values corresponding to (3 + §8, where s is 
the lower bound for a 95% one-sided confidence interval for s and where = and 8 is given 
by (1). Compare Figure 2. 
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Table 1 

Generic direction of recession for contingency table example. Only nonzero components 

shown. 

coefficient direction 

intercept — 1 

Vi 1 

V2 1 

Vs 1 

V5 1 

v 1 : v 2 -1 

Vl : U3 —1 

Vi : vs -1 

v 2 : V3 -1 
V2 : V5 —1 
V3 :vs -1 

vi : V2 : V3 1 
1>1 : i>3 : i) 5 1 
i>2 : i>3 : ^5 1 



out3 <- glm(y ~ (vl + v2 + v3 + v4 + v5 + v6 + v7)~3, 
family = poisson, data = dat , x = TRUE) 

fits the model with all three-way interactions but no higher-order interactions, 
assuming the data have been read in as data frame dat. 

R does not complain in fitting this model, even though it should. The correct 
MLE is "at infinity." 

This model has 64 parameters (for a table with 128 cells). One might say this 
is too many parameters chasing too little data, but a test comparing this model 
to the model with only two-way interactions says the three-way model fits the 
data much better (P ps 10~ 17 ). Whether one likes this model or not, it should 
be possible for statisticians to analyze it, and we will use it for an example. 

With 64 parameters we do not show the whole GDOR, only its nonzero com- 
ponents, which arc shown in Table 1. It is hard to visualize as a 64-dimensional 
vector. Much easier to understand is what it does to the mean cell counts. 
Sixteen cells have mean zero at the MLE in the Barndorff-Nielsen completion. 
They are shown in Table 2. One other cell has observed data value zero but 
MLE mean value nonzero. 

The next step in the analysis is to find a (3 such that (2) is satisfied. It has 
long been known that such a j3 is determined by finding the MLE in the family 
conditioned on certain cells being zero, in this case those in Table 2. This is 
easily accomplished by removing those rows from the data. The following R 
statements fit this model 

dat.cond <- dat [linear, ] 

out3.cond <- glm(y " (vl + v2 + v3 + v4 + v5 + v6 + v7)~3, 
family = poisson, data = dat.cond) 

where linear indicates the cells that have nonzero MLE mean values and dat 
is the data frame containing the original data. We do not show the results 
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Table 2 

Cells with MLE mean value zero and 95% confidence intervals (lower and upper are lower 

and upper confidence bounds). 
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^6 


v 7 


lower 


upper 


























0.2863 











1 
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0.1410 
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0.3239 



of this fit — it is shown in the accompanying technical report (7) — partly 
because there are 64 regression coefficients, but mainly because there is nothing 
new in this fit or its use to make inference about the cell mean values or the 
corresponding linear predictor values for the cells of the table involved (those 
with linear equal to TRUE). The R function predict. glm will produce valid 
confidence intervals for these cells using this fit. 

We only remark that (3 is a little strange in that it has one nonestimablc 
coefficient (which R indicates as being NA) because the model matrix is not full 
rank. The original model matrix was full rank, but when we remove sixteen of its 
rows the result is not full rank. R is smart enough to drop one or more columns 
(in this example just one) of the model matrix producing a new model matrix 
that is full rank and has the same column space as the one it was given, which 
is equivalent to setting the coefficients corresponding to the dropped columns 
to zero. Although R reports these coefficients as NA, we must take them to be 
zero. 

2.3.1. Confidence Intervals 

To obtain one-sided confidence intervals we use the same procedure used in 
the other example. Consider for all real s the probability distribution having 
parameter (3 + sS. As s goes from -co to +oo the probability of observing data 
having zeros in the 16 cells listed in Table 2 goes from zero to one. Find the 
unique s that makes this probability 0.05, call it s, then [s, oo) is a 95% confi- 
dence interval for the scalar parameter s. Calculate the mean value parameter 
vector corresponding to (3 + s8. This gives the upper bounds for 95% confidence 
intervals shown in Table 2. 
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2.3.2. Hypothesis Tests 

When the MLE does not exist in the conventional sense for the null hypothesis 
of a likelihood ratio test, the usual asymptotics do not hold and the usual 
test is invalid. However, the usual asymptotics may apply when the test is 
done conditionally, the conditioning event being the same as the one used in 
determining 0. In this example, if the null hypothesis is all three-way but no 
higher- way interactions, one merely proceeds using the data frame data.cond 
produced above to fit both the null and alternative models and do the likelihood 
ratio test. This idea was suggested by S. Fienberg (personal communication). 

Of course, one does not have to believe the asymptotics on which this test 
is based. One can instead calculate a P-value based on a parametric bootstrap, 
using as the null distribution the one fitted in out3 . cond. Since this procedure 
has nothing to do with the concerns of this article, being the same thing one 
would do whenever one distrusts asymptotics, we will say no more about it. 

3. Theory 

We redevelop the theory of Barndorff-Nielsen completion of exponential fam- 
ilies (2, Sections 9.3 and 9.4; 3, pp. 191-202; 5, Chapters 2 and 4) so that it is 
useful for calculation, particularly calculation using the R contributed package 
redd. 

3.1. Exponential Families 

An exponential family of distributions (2; 3; 5) is a statistical model having 
log likelihood 

l(6) = (y,9)-c(6), (3) 

where y = (y±, . . . , y p ) is a vector statistic, 9 = {6\, . . . , 9 P ) is a vector parameter, 
and 

i=l 

A statistic y and parameter 9 that give a log likelihood of this form arc called 
natural. The function c is called the cumulant function of the family. 

The distribution with parameter value 9 has a density with respect to the 
distribution with parameter value ip of the form 

fe{uj) = eWoO.^-cW+cWO, ( 4 ) 

The requirement that this integrate to one determines the function c up to an 
additive constant 

c(60 = c(V0+log^(e< r > e -^>). (5) 
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We take (5) to be valid for all 9 in MP, defining c(9) = oo for 9 such that the 
expectation in (5) is infinite. Define 

6 = {9 G M p : c(9) < oo}. (6) 

The exponential family is full if its natural parameter space is (6). We shall be 
interested only in full families. 

By convention, the cumulant function and log likelihood arc defined for all 
9 G W not just at valid parameter values. We have c(9) = oo and 1(9) = — oo 
for 9 0, so such 9 can never maximize the likelihood. 

3.2. Tangent Cone, Normal Cone, and Convex Support 

The tangent cone of a convex set C at a point y G C is 

Tc{y) = cl{ s(w — y):w&C and s > }, (7) 

where cl denotes the closure operation (14, Theorem 6.9). The normal cone of 
a convex set C in W at a point y G C is 

Nc{y) = {5 <ER P : {w — y, 6) < for all w G C }. (8) 

(14, Theorem 6.9). Tangent and normal cones are polars of each other (14, 
Theorem 6.9 and Corollary 6.30). Each determines the other. 

The convex support of an exponential family is the smallest closed convex set 
that contains the natural statistic with probability one under some distribution 
in the family, in which case this is true for all distributions in the family, because 
the distributions are mutually absolutely continuous (2, pp. 90 and 111-112). 

3. 3. Directions of Recession and Constancy 

Directions of recession and constancy of convex and concave functions are 
defined by Rockcfellar (13, p. 69). We apply these notions to log likelihoods of 
full exponential families. Proofs of all theorems and corollaries arc given in the 
appendix. 

Theorem 1. For some vector 6 and for a full exponential family with log like- 
lihood (3), natural parameter space ; convex support C, natural statistic Y , 
and observed value of the natural statistic y such that y G C ' , the following are 
equivalent. 

(a) There exists a 9 G such that s t— * 1(9 + sS) is not a strictly concave 
function on the interval where it is finite. 

(b) For all 9 G the function s i— > 1(9 + sS) is constant on K. 

(c) The parameter values 9 and 9 + sS correspond to the same probability 
distribution for some 9 G and some s =/= 0. 

(d) The parameter values 9 and 9 + sS correspond to the same probability 
distribution for all 9 G and all real s. 
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(c) (Y — y, 6) = almost surely for some distribution in the family. 

(f) (Y — y, S) = almost surely for all distributions in the family. 

(g) SeN c (y) and -5eN c (y). 

(h) (w,S) =0, for allw £T c {y). 

Any vector 6 that satisfies any one of the conditions of the theorem (and hence 
all of them) is called a direction of constancy of the log likelihood. The set of 
all directions of constancy is called the constancy space of the log likelihood. 
It is clear from (e) or (h) of the theorem that the constancy space is a vector 
subspace. 

Corollary 2. For a full exponential family, suppose 9\ and O2 are maximum 
likelihood estimates. Then 9\ — 62 is a direction of constancy. 

From the corollary and (d) of the theorem, we see that directions of constancy 
do not cause any problem for statistical inference, because all maximum likeli- 
hood estimates correspond to the same probability distribution. Thus we have 
uniqueness where it is important. Nonuniqucncss of the MLE for the natural 
parameter is, at worst, merely a computational nuisance. 

A family is said to be minimal if it has no directions of constancy. This can 
always be arranged by reparametrization (2, pp. 111-116; 3, pp. 13-16; see also 
5, Section 1.5). The R function glm always uses a minimal parametrization, 
dropping predictors to obtain a full rank model matrix. However, insisting on 
minimality can complicate theoretical issues. Better to keep options open and 
allow for directions of constancy. 

Theorem 3. For some vector 5 and for a full exponential family with log like- 
lihood (3), natural parameter space 0, convex support C , natural statistic Y , 
and observed value of the natural statistic y such that y £ C ' , the following are 
equivalent. 

(a) There exists a 9 £ such that the function s <— » 1(9 + sS) is nondecreasing 
on E. 

(b) For all 9 € the function s 1— > 1(9 + sS) is nondecreasing on R. 

(c) (Y — y, S) < almost surely for some distribution in the family. 

(d) (Y — y, 5) < almost surely for all distributions in the family. 
(c) 6 6 N c (y). 

(f) (w, 5) < 0, for all w G T c (y). 

Any vector 8 that satisfies any one of the conditions of the theorem (and hence 
all of them) is called a direction of recession of the log likelihood. From now 
on we will simply say direction of recession or constancy to refer to directions 
of recession or constancy of the log likelihood. Note that every direction of 
constancy is a direction of recession. 

Theorem 4. For a full exponential family with convex support C and observed 
value of the natural statistic y such that y 6 C ' , the following are equivalent. 

(a) The MLE exists. 

(b) Every direction of recession is a direction of constancy. 
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(c) Nc(y) is a vector subspace. 

(d) Tc{y) is a vector subspace. 

This theorem provides a complete geometric solution to the problem of when 
the MLE exists in a full exponential family. It is the basis of the complete 
computational solution given in Section 3.12. 

Corollary 5. For a full exponential family with log likelihood (3), natural pa- 
rameter space O, convex support C , and observed value of the natural statistic 
y such that y G C , if 8 is a direction of recession that is not a direction of 
constancy, then for all 9 G the function s <— > 1(0 + sS) is strictly increasing on 
the interval where it is finite. 

3-4- Limits in Directions of Recession 

Theorem 6. For a full exponential family having log likelihood (3), densities 
(4), natural statistic Y, observed value of the natural statistic y such that y 
is in the convex support, and natural parameter space O, if S is a direction of 
recession, 

H = {w el p : (w-y, S) = }, (9) 

and pr(y G H) > for some distribution in the family, and hence for all, then 
for all 6 G 6 

fo, (F(w)-y,5)<0 
lim f e +s S (u) = lfe(u})/pr e (Y €H), (Y(u)-y,6)=0 (10) 

s — >-oo 

[+oo, (Y(u;)-y,6)>Q 

If S is not a direction of constancy, then s <—y pr^,^ (V G H) is continuous and 
strictly increasing, and pr g +s g(Y G H) — ► 1 as s — > oo. 

We note three things about the right-hand side of (10). First, it is a probabil- 
ity density with respect to the distribution having parameter value ijj. The set 
where it is +00 has probability zero by Theorem 3 (d), so this is not a problem. 
Second, it is the density of the conditional distribution given the event Y G H 
of the distribution having parameter value 9. Third, by Scheffc's lemma (12, 
p. 351) pointwise convergence of densities implies convergence in total varia- 
tion, which implies convergence in distribution. Denote the right-hand side of 
(10) by f e (iu I Y G H). 

It is clear that the family 

{fe(- \Y EH):9ee} (11) 

is an exponential family with the same natural statistic and natural parameter 
as the original family. It need not be full. The natural parameter space of the 
full family containing it is at least as large as 

O + r Hm = { 9 + 7 : 9 G 9 and 7 G T lim }, (12) 
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where O is the natural parameter space of the original family and ru m is the 
constancy space of the family (11). We will assume that (12) is the natural 
parameter space of the full family containing (11), and we will call this full 
family the limiting conditional model (LCM). 
It is clear that the log likelihood for (11) 

l H (d) = (y,6) - c(9) -logpr e (Y e H) 

satisfies 

l(0)<l H {9), 6 9. 

Thus, if an MLE exists for the LCM, then it maximizes the likelihood in the 
family that is the union of the LCM and the original family. When this happens, 
we say we have found an MLE in the Barndorff-Niclsen completion of the original 
family. 

3.5. Convex Polyhedra 

A set is a convex polyhedron if it is the intersection of a finite collection of 
closed half-spaces or, alternatively, if it is the convex hull of a finite set of points 
and directions (13, Chapter 19). The equivalence of these two characterizations is 
called the Minkowski- Weyl theorem (13, Theorem 19.1). The R function sedd in 
the contributed package redd (9) converts between these two representations of 
a convex polyhedron, which it calls the H-representation and V-representation, 
respectively. 

We will need more details of V-representations. They characterize a convex 
polyhedron as the set of all linear combinations 

} J bicti, (13) 

i£EUI 

where the on are vectors, the bi are scalars, E and / are disjoint finite sets, and 
the bi satisfy 

bi > 0, ieEUl (14a) 

and if / is nonempty 

5> = 1. (14b) 

iei 

The sets P = {oi : i E / } and D = {ai : i E E} are the called points and 
directions, respectively, constituting the V-representation. When P is empty, we 
write 

C = con(pos D) 

to denote the relationship between D and the convex polyhedron C it charac- 
terizes. This notation follows Rockcfellar and Wets (14, Sections 2E and 3G). 

When C is a convex polyhedron, Nc(y) and Tc{y) are also convex polyhedra 
for each y 6 C and arc given in terms of the H-representation of C by simple 
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formulas (14, Theorem 6.46). Moreover, the closure operation in (7) is unneces- 
sary when C is a convex polyhedron. If a convex cone, such as Tc{y) or Nc{y), 
is polyhedral, then its V-representation can consist of directions only, so can be 
of the form con(pos V) for some finite set V. 

For all exponential families we deal with, we assume that the convex support 
is polyhedral. 

3. 6. Generic Directions of Recession 

The relative interior of a convex set C, denoted rint C, is its interior relative 
to its affine hull (13, Chapter 6). We say a vector S is a generic direction of 
recession (GDOR) if S € rintiVc(y) and Nc(y) is not a vector subspace, where 
C is the convex support and y an observed value of the natural statistic such 
that y £ C. Since the relative interior is always nonempty (13, Theorem 6.2), a 
GDOR exists if and only if none of the conditions of Theorem 4 hold. 

Theorem 7. For a full exponential family having polyhedral convex support 
C and observed value of the natural statistic y such that y G C, let Tc(y) = 
con(posV / ) ; and define 



Then a generic direction of recession exists if and only if L ^ V , in which case 
a vector 5 is a generic direction of recession if and only if 



Corollary 8. Under the assumptions of the theorem, a generic direction of 
recession is not a direction of constancy. 

If B is a set of vectors, let spani? denote the smallest vector subspace con- 
taining B. Also for any vector x, let x + span_B = { x + v : v £ spani? }. 

Corollary 9. Under the assumptions of the theorem, suppose S is a generic 
direction of recession, and H is defined by (9). Then Tcnii{y) = spanL, and 
CCiH = Cn (y + spanL). 

The theorem and corollaries explain the purpose of generic directions of re- 
cession. By Corollary 8, a GDOR implies the MLE does not exist in the conven- 
tional sense, so we seek it in the LCM described in Section 3.4 using the GDOR 
as the 5 in Theorem 6. Suppose that C H H is the convex support of the LCM. 
Then TcnH (y) being a vector subspace implies that the MLE in the LCM exists 
by Theorem 4 (c). Thus finding one GDOR allows us to find the MLE in the 
Barndorff-Niclscn completion. 



L = {veV:~veT c (y)}. 



(w,S) = 0, 
(w,S) < 0, 



w £ V\L 



w e L 



(15a) 
(15b) 
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3.7. Assumptions 

We summarize the assumptions we have made above. We deal with a full ex- 
ponential family having convex support C that is a polyhedron. If every direction 
of recession is a direction of constancy, then we need no further assumptions. 
Otherwise, let S be a generic direction of recession, let C be the convex support, 
let Y be the natural statistic, let y be an observed value of the natural statistic 
satisfying y s C, and let H be defined by (9). We assume the event Y £ H has 
positive probability so the LCM defined in Section 3.4 exists. We further assume 
that C fl H is the convex support of this LCM, so that by Corollary 9 the MLE 
in this LCM exists. We further assume that the natural parameter space of this 
LCM is given by (12), so that confidence intervals (Section 3.16 below) work. 

3.8. Comparison with Pre-Existing Theory 

The pre-existing theory of the Barndorff-Nielsen completion (2; 3; 5) says the 
MLE lies in the LCM whose convex support, what we are calling C n H, is the 
unique face of C containing y in its relative interior (5, Chapter 4 generalizes 
this). Thus the pre-existing approach makes it clear that the LCM containing 
the MLE is unique and does not depend on the GDOR, which is in general 
not unique. In our approach, uniqueness comes from the assertion C fl H = 
Cn(y+spanL) in Corollary 9. This makes it clear that, although the hyperplane 
H does depend on the GDOR 8 used to define it, the convex support C fl H of 
the LCM does not depend on H, hence does not depend on 6. 

3.9. Natural Affine Submodels 

In most applications of exponential family theory, we start with a very large 
exponential family, which we call saturated and which has too many parameters 
to estimate well. Then we consider natural affine submodels, parametrized by 

6 = a + M(3, 

where 9 is the natural parameter of the saturated model, (3 is the natural pa- 
rameter of the natural affine submodel, a is a known vector, and M is a known 
matrix. In the terminology of the R function glm, a is called the offset vector 
and M is called the model matrix. 
Because 

(y,a + MP} = (y,a} + (M T y,P), 

where the two bilinear forms on the right-hand side have different dimensions, 
and the first term on the right-hand side does not contain the parameter and 
can be dropped from the log likelihood, the submodel is itself an exponential 
family with natural statistic M T y and natural parameter j3. Thus everything 
said above applies to natural affine submodels, we just work with the convex 
support of M T Y rather than of Y. 
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3. 1 0. Tangent Cones of Models and Affine Submodels 

Let C sat denote the convex support of the saturated model and C su b that of 
the natural affine submodel. By Theorems 6.43 and 6.46 in (14), 

Zb. ub (M T y) = cl{ M T w : w € T Caat (y) } (16) 

and the closure operation is not necessary if C sa t is polyhedral. Moreover, it is 
clear that if Tc 3at (y) = con(pos V^t), then T Csuh {y) = con(pos V suh ), where 

V suh = {M T w:weV sa . t }. (17) 

3.11. Tangent Cones of Saturated Models 

If we assume nothing about C sa t except that it is polyhedral, then computa- 
tion of a V- representation V sa ,t for it can be arbitrarily difficult. Thus we shall 
say nothing about the general case and focus on cases of practical interest. 
For generalized linear models, the convex support of the saturated model is a 
Cartesian product. For logistic regression each component of the response vector 
is Bernoulli and C sa t = [0, l] p . For Poisson regression, each component of the 
response vector is Poisson and C sa t = [0, oo) p . 

When C sa t is a Cartesian product, Ic aat (y) can be calculated coordinatewise 
(14, Proposition 6.41). Let et denote the unit vector in the z-th coordinate 
direction (every coordinate is zero except for the i-th, which is one). Then 
is a tangent vector at y if yi is not at the upper bound of its range, and — e$ 
is a tangent vector at y if yi is not at the lower bound. Hence when C sa t is a 
Cartesian product, its V-representation V sat contains or — or both for each 
i. 

In log-linear models for categorical data analysis, if Poisson response is as- 
sumed, then C sa t = [0, oo) p , just as in Poisson regression. If multinomial or 
product-multinomial response is assumed, then C sa t is not a Cartesian product, 
but, as we shall see (Section 3.17 below), the MLE and GDOR are the same re- 
gardless of the assumed response distribution. Hence the solution for the Poisson 
response case can be used for the other cases. 

Thus this analysis of the Cartesian product case suffices for the vast majority 
of applications, and it suffices for our examples. An application where the convex 
support of the saturated model is not a Cartesian product and this analysis 
would not suffice is unconditional aster models (11). 

3.12. Calculating the Linearity 

Next we determine the linearity of V su b 

L S nh = { w e Kmb : -w E con(pos Fsub) }• (18) 

This sounds like a complicated operation, and it is, but the redd package has a 
function linearity that does it by repeated linear programming. 
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Having found the linearity, we have a complete computational solution to the 
problem of when the MLE exists in a full exponential family with polyhedral 
convex support. It exists if and only if L su b = V^ u b- 

All functions in the redd package use two forms of arithmetic. One is the 
default computer arithmetic used by all other R functions. Answers produced 
using that arithmetic are inexact, so one is uncertain whether the L S ub produced 
is actually correct. The other form of arithmetic is exact, infinite-precision, 
rational arithmetic. Answers produced using that arithmetic are exact, so one is 
certain that the L S ub produced is actually correct, but only if the vectors in V su b 
are also produced exactly using either integer arithmetic or rational arithmetic. 

According to comments in the source code (starting at line 3062 of the file 
cddlp.c of the source code for the redd package (9), which comes from the 
eddlib library (4), for each w 6 V sn h the R function linearity solves the 
linear programming problem 

maximize 

(w,6) 

, • (19) 
subject to 

(v, 6)>0, v G V suh \ {w} 

where S is the state vector of the linear programming problem. 

Theorem 10. A vector w is in the linearity (18) if and only if the optimal 
value of the linear program (19) is nonpositive. 



3. 1 3. Calculating Generic Directions of Recession 

If I/ S ub 7^ Kjub, then Tc Bub (M T y) is not a vector subspace, hence there exists 
a generic direction of recession. By Theorem 7, 6 is a GDOR if and only if 

(w, S) = 0, we L suh (20a) 
(w, 5)<0, we V Buh \ L suh (20b) 

Hence we can find one such S by solving the following linear program 

maximize 
c 

subject to 

e < 1 

(v, S) = 0, v e L suh 

(v, S) <-e, v e V sxlh \ i sub 

where 5 is a (/-dimensional vector, e is a scalar, and (5, e) is the state vector of 
the linear program (so the dimension is q + 1). The S part of the solution is a 
generic direction of recession. The e part does not matter. 
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The idea for using this particular linear program came from the documenta- 
tion for the dd_ExistsRestrictedFace2 function in the eddlib library (written 
by K. Fukuda). The redd package does not provide an interface to this eddlib 
function, but it does provide a function lpcdd that does linear programming 
and can be used to solve this linear program. 

3.14- Calculating Maximum Likelihood Estimates 

3.14-1- In the Original Family 

There is little to be said about calculating the MLE in the original family. 
When we have found that L su b = V su h, then we know the MLE exists and can 
use available software to find it. We will use the R function glm for our examples. 

There is one issue worth mentioning. If the model is non-identifiable, so the 
MLE is non-unique, the R function glm is smart enough to drop enough predic- 
tors to produce an identifiable model. However, its method of doing so is not 
guaranteed because of inexactness of the default computer arithmetic. 

We can use the redundant function in the redd package applied to the 
columns of the model matrix M to reduce to a linearly independent subset. 
If M was calculated using integer or rational arithmetic, and redundant uses 
rational arithmetic, then this operation will be exact. 

3.14-2. In the Completion 

When we have found that L su b ^ V su b and have found a generic direction 
of recession 6, we still need to characterize the support of the LCM. We can 
characterize this two ways using either of 

ffsub = { w G R q : (w - M T y, S)=0} 
H sat ={weW : (w-y, AI5) = } 

where p and q are the dimensions of the saturated model and affine submodel, 
respectively. Then the LCM conditions on the event M T Y G ff S ub or Y G H sa ,t, 
which is the same event characterized two different ways, the latter usually 
simpler. 

Theorem 11. In the setup of Sections 3.12 through 3.14, define 

L sa t = { v G V sat : M T v G L suh }. 

Then the support of the limiting conditional model is 

Csat n i? sat = C sat n (y + spanL sat ). (21) 

In the case where the convex support of the saturated model is a Cartesian 
product, the support of the LCM simply constrains Yi = for i such that 
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Gj ^ L aa t, that is, the i-th component of the response is unconstrained if a G L sat 
and is constrained to be equal to its observed value if e, ^ L sat . 

Our analysis of maximum likelihood estimation in the Barndorff-Nielsen com- 
pletion is now finished. At least in the Cartesian product case, the maximum 
likelihood problem in the completion is of the same form as the original prob- 
lem. The only difference is that we constrain certain components of the response 
vector to their observed values. This can be achieved by removing those com- 
ponents from the response vector and proceeding as if the resulting subvector 
were the entire response vector. If, for example, we are using the R function 
glm to fit models, we merely delete certain elements of the response vector and 
the corresponding rows of the model matrix (or the data frame containing the 
data if we are using a formula to specify the model) and proceed normally. This 
LCM produced by deleting some components of the response will always be 
non-identifiable, because S will always be a direction of constancy for the LCM 
and there may be other directions of constancy. The glm function, however, can 
deal with this issue. Furthermore, even in the rare case when the glm function 
may be confused, we can find a full rank model matrix having the same column 
space as the original model matrix using the function redundant in the redd 
package, as described in the preceding section. 

3.15. Likelihood Ratio Tests 

Given two nested natural affine submodels, the maximum value of the log 
likelihood can be calculated for each submodel by available software, such as 
the R function glm, which goes uphill on the log likelihood until reaching a 
point where the log likelihood is nearly flat, in which case the value of the log 
likelihood is nearly the maximum. If the MLE docs not exist in the conventional 
sense, then the natural parameter estimates will be large but not infinite, and 
the glm function may or may not give a warning about lack of convergence. If 
the MLE does not exist in the conventional sense, then the natural parameter 
estimates are infinitely wrong, but the value of the maximized log likelihood is 
nearly correct. Thus we can correctly calculate the likelihood ratio test statistic 
without using any of the theory developed here. 

This docs no good, however, because the usual asymptotics of the likelihood 
ratio test (Wilks' theorem) do not hold in the case where the MLE for the 
null model does not exist in the conventional sense. In this case, the following 
simple correction, suggested by S. Ficnberg (personal communication) seems 
reasonable. Apply the theory developed here to the null model, determining 
L s& t 7^ Ksat- Let Mo and Afi be the model matrices for the null and alternative 
natural affine submodels. 

If we apply Wilks' theorem to the LCM for the null hypothesis, we obtain 
the result that the deviance (twice the log likelihood ratio) is approximately 
chi-squared with degrees of freedom which is the difference in dimension of 
M^spanLgat) and of M^span L sa t). Assuming that Mo, Mi, and L sa t were 
determined exactly using rational arithmetic, the degrees of freedom can be 
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determined exactly by applying the redundant function in the redd package to 
the sets { Mfw : w G L sat }, i = 0, 1. 

This asymptotic approximation may or may not hold depending on the sam- 
ple size and on how close the observed value of the natural statistic is to the 
boundary of the convex support of the LCM. By construction, it cannot be on 
the boundary, but if it is close the asymptotic approximation can be bad. If one 
is worried about the validity of the asymptotic approximation, one can always 
do a parametric bootstrap calculation based on the LCM for the null hypothesis. 

3.16. Confidence Intervals 

Confidence intervals are more complicated than hypothesis tests. Confidence 
intervals for both natural and mean value parameters are of interest. The R 
function predict .glm provides either, depending on the value of its type argu- 
ment. We will also provide either. 

Before getting into details, we should first note that confidence intervals are 
often inappropriate from a theoretical point of view. When there is not a single 
scalar parameter of interest, a confidence region for the vector parameter of in- 
terest should, theoretically, be provided. However, high-dimensional confidence 
regions are unvisualizable and of no interest to users. Thus statisticians usually 
provide something users think they can interpret, which is multiple confidence 
intervals, often not adjusted for simultaneous coverage. That is what, for ex- 
ample, predict . glm provides. We will follow the usual practice, providing such 
confidence intervals in our examples. The accompanying technical report (7) 
discusses confidence regions. 

In the case where no GDOR exists and the MLE exists in the conventional 
sense, we say the original family is the LCM, which it is when we take 6 — 
in Theorem 6. Then there is always an LCM. We calculate confidence intervals 
conditional on the LCM. If we achieve approximately the desired confidence 
level conditionally, then we also achieve it unconditionally. 

3.16.1. In the Limiting Conditional Model 

Our strategy is, like it was with hypothesis tests, to use standard methods 
applied to the LCM as much as possible. Unlike it was with hypothesis tests, 
however, no simple modification of standard methods tells us all we want to 
know. When we fit the LCM, available software gives us confidence intervals for 
its regression coefficients, for components of its linear predictor vector, and for 
its mean value parameters. Its mean value parameter vector is restricted to its 
convex support C n H. No procedure based on the LCM can indicate how close 
the unknown true distribution of the data is to being concentrated on C n H. 
To accomplish that, we need to refer to the original model. 
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3.16.2. In the Original Model 

Let 9 be an MLE in the LCM and let Li; m be the constancy space of the LCM, 
so that 6>+7 is also an MLE in the LCM for 7 G ri im . Let S be a GDOR. We know 
from Theorem 6 that, regardless of the 7 chosen, distributions in the original 
model having parameter values 9 + 7 + sS converge to the MLE distribution in 
the LCM as s — > 00. We seek an answer to the question how close s must be to 
infinity. Thus we seek one-sided intervals for the scalar parameter s of the form 
(s 7 ,oo). 

The conventional conservative P-value for the upper-tailed test having test 
statistic (Y, 5) , null hypothesis 9 + 7 + sS, and observed data y is 

We +1+sS ((Y-y,6)>0). (22) 

A conventional level a test rejects when (22) is less than or equal to a. A 
conservative 1 — a confidence interval consists of the set of s values that are not 
rejected at level a. In the only case of interest to us, where (y,8) is the largest 
possible value of (Y, 6} so the event (Y — y, S) > is the same as Y G H up to 
a set of measure zero, 

J 7 = inf{ set J + 7 + si5e9 and pr§ +1+sS (Y G H) > a}. 

Since s <— ► +sS (Y G H) is continuous and strictly increasing by Theorem 6, 
usually s 7 is the unique s such that prg +7+S(5 (y G H) = a. The accompanying 
technical report (7) explains how these intervals can be made exact using the 
method of fuzzy confidence intervals (8). 

Since all of the intervals (s 7 ,oo) for different 7 are based on the same test 
statistic (Y, 6), they have simultaneous coverage at least 1 — a. Hence the union 

{d + j + sS-.'ye r Um and s > s 7 } (23) 

of those natural parameter values provides a confidence region for natural pa- 
rameter values showing "how close to infinity" they may be. 

This completes what the original model can say about natural parameters. 
When converted to statements about mean value parameters, these confidence 
intervals say how close the true unknown distribution may be to being concen- 
trated onCnff. 

3.16.3. The Cartesian Product Case 

In the case where the convex support of the saturated model is a Cartesian 
product, all of this simplifies somewhat. We know that a confidence region for 
the mean value parameters of the LCM only involves response variables that are 
not constrained to be at their observed values in the LCM. We take confidence 
intervals, such as those provided by the R function predict .glm applied to the 
limiting conditional model, to be adequate for describing those components of 
the response. 
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Fig 2. Careful confidence intervals for the mean value parameter in the example of Section 2.2. 
Outer points having values zero or one are MLE for the mean value parameter, which are 
also the components of the observed data vector. Inner points are mean values corresponding 
to + 7 + SjS, where s 7 is the lower bound for a 95% one-sided confidence interval for s, 
where (3 = and S is given by (1), and where 7 is chosen differently for each mean value to 
make the interval as large as possible. Compare Figure 1. 



Our one-sided intervals come into play in computing one-sided confidence 
intervals for the mean value parameters of the other components of the response. 
Since the MLE of their mean value parameters are on the boundary, one-sided 
intervals are the only kind that make sense. We distinguish two cases. 

The first case is where rn m = { s6 : s 6 M}. In this case, all intervals (s 7 , 00) 
for different 7 6 Tn m correspond to the same natural parameter values. Hence 
one such interval adequately describes the variability in mean value parameters 
for components of the response having MLE at the boundary. This case contains 
our contingency table example (Section 2.3). 

The second case is where rn m contains vectors other than scalar multiples of 
the GDOR. In this case, intervals (s 7 ,oo) for different 7 e ri; m correspond to 
different natural parameter values. Since they have simultaneous coverage, we 
can use the union (23) to determine these confidence intervals. Figure 2 shows 
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confidence intervals for mean value parameters based on this idea for our logistic 
regression example (Section 2.2). 

Calculation of these intervals theoretically requires knowing s 7 for all 7 G 
Tii m , which entails an infinite amount of work. We have only calculated for a 
large finite set of 7. 

3.16.4- Calculating the Constancy Space 

By Theorems 1 and 4 the constancy space is Nc(y) in the case where every 
direction of recession is a direction of constancy. By Corollary 9 the tangent 
space Tcnniy) in the LCM is spanL. Hence the constancy space is 

Ncnniv) = { S e W : (v,S) = 0, v € span L }. 

In the case of a natural affine submodel this becomes 

Tiim = N CsubnHsub (M T y) = {5 eR q : (v, 5) =0, v E span L sub }. (24) 

Since L S ub is a V-reprcsentation of its span, a call to the function sedd in the 
redd package will compute an H-representation of its span which is also a V- 
representation for (24), that is, a basis for the constancy space of the LCM. 

When calculating (24) for the purpose of generating one-sided confidence 
intervals it is clear that we do not need to let 7 range over the whole constancy 
space (24) because points 7 and 7 + sS lead to the same one-sided confidence 
intervals. Hence it is enough to use the subspace of Tn m orthogonal to 8, which 
is calculated by feeding L su b U {6} to the R function sedd for conversion to an 
H-representation, which will also be a basis of the desired subspace. 

3.17. Multinomial Sampling 

Consider one contingency table and one vector of observed data, but two 
models: Poisson sampling and multinomial sampling. As is well known (1, Sec- 
tion 8.6.7), the maximum likelihood estimates for the mean value parameters 
are the same for both sampling schemes. But much more is the same. 

Suppose we consider the natural statistic to be the vector of cell counts for 
both models, so both have the same natural statistic and natural parameter. 
For Poisson sampling, there are no directions of constancy and the MLE for 
the natural parameter is unique. For multinomial sampling, the vector 7 = 
(1,1,. ..,1) is a direction of constancy, and the MLE for the natural parameter 
is not unique. However, it is easy to see that the unique MLE for the Poisson 
model is also an MLE for the multinomial model. 

Moreover, when we use the same natural statistic and parameter for both 
models, the computational geometry is similar. If subscripts P and M refer to 
the Poisson and multinomial models, respectively, then 

C sa t,M = { y e C sat ,p : (y, 7} = n } 
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where n is the sample size. Also note that 7 is the first column of M, the 
"intercept" column. From this it follows that 

Tc Bnb , M (M T y) = {ve Tc BnbtP (M T y) : (v,e 1 ) =0} 

where e± = (1, 0, . . . , 0). And from this it follows by Theorem 6.42 in (14) that 

Nc sub . M (M T y) D{v + se 1 :ve N Canb , P (M T y), sel}. 

Hence every direction of recession in the Poisson model is also one in the multi- 
nomial model. Similarly every GDOR in the Poisson model is also one in the 
multinomial model. Hence the support of the LCM calculated using such a 
GDOR is also correct. Hence parameter estimates for the LCM are the same for 
both Poisson and multinomial sampling, as are the asymptotic likelihood ratio 
tests described in Section 3.15 above. 

The only thing wc need to change for multinomial sampling is our one-sided 
confidence intervals described in Section 3.16.2 above, because they are based 
on exact probabilities that differ between the two models. The modification is 
obvious: in calculating pxg + +8S (Y G H) we use the multinomial distribution 
rather than the Poisson distribution. Table 2 in the accompanying technical 
report (7) shows the analog of our Table 2 modified using multinomial rather 
than Poisson sampling. 

Product-multinomial sampling is very similar to the situation for multinomial 
sampling. We only note that each sum of components of the response that is 
fixed must be a column of the model matrix if the MLE for Poisson sampling 
are to match that for product-multinomial (1, Section 8.6.7). Details are left as 
an exercise for the reader. 

4. Discussion 

Part of the impetus for writing this article was having to say to a scientist, 
"you are just out of luck, the solution is 'at infinity' and this problem is well 
understood theoretically but software just doesn't handle it — the only thing 
you can do with existing software is fit a smaller model that doesn't have the 
solution 'at infinity' even though this smaller model admittedly (1) does not fit 
the data and (2) does not address the questions of scientific interest." I am glad 
I will never have to say this again. 

Questions of usability and user interface remain. The R function glm and 
generalized linear models software in other statistical computing environments 
(SCE) should just do the right thing when the MLE does not exist in the conven- 
tional sense. This would not only require additional programming — not much 
for R but much more for other SCE that do not have a computational geom- 
etry package — but also would break backward compatibility. The R function 
predict .glm, like other methods of the generic function predict, specifies con- 
fidence intervals by estimates and standard errors (components fit and se.f it 
of the returned object). This clearly will not do for one-sided intervals. 
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The theorems and corollaries of this article are very general, applying to 
all known applications. The theory of hypothesis tests presented (due to S. 
Fienberg) is also general. For confidence intervals and regions, however, open 
research questions remain. 

Aster models (11) allow components of the response vector to be dependent. 
For example, the conditional distribution of Yj given Yk can be Binomial(Yfc, pj). 
When one has yj = yk in the observed data, it can happen that the mean value 
parameter vector with components fij = E(Yj) has \ij = jlk at the MLE, in 
which case a one-sided confidence interval for \ik — /ij is appropriate, but one- 
sided intervals for either /ij or fik do not make sense. More generally, the R func- 
tion predict . aster, unlike other methods of the R generic function predict, 
computes confidence intervals for arbitrary linear functions of the mean value 
parameter vector, which is often useful in applications (the example in 11 and 
all three examples in 15). It is not clear how the one-sided and two-sided con- 
fidence intervals discussed in Sections 3.16.2 and 3.16.1 above combine in this 
setting. 

Even though issues remain, ordinary R users should be able to follow the 
examples in the accompanying technical report (7) to make valid hypothesis 
tests and confidence intervals for GLM and loglinear models for contingency 
tables in cases where the MLE does not exist in the conventional sense and 
previously available software was useless. 
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Appendix A: Proofs 

Proof of Theorem 1. Clearly, ,s i— > 1(9 + sS) fails to be strictly concave if and 
only if s i — * c(9 + sS) fails to be strictly convex, and by Theorem 2.1 in (5) 
this happens if and only if (Y, S) is concentrated at one point, in which case 
this point must be (y,S) so (e) holds. Since all distributions in the family are 
mutually absolutely continuous by (4), (e) implies (f), which trivially implies 
(e). If (f) holds, then by (5) 



Hence (b) holds, and (b) clearly implies (a). We have now proved that (a), (b), 
(e), and (f) are equivalent. 

Also (25) implies (d) by (4), so (f) implies (d). Trivially, (d) implies (c). 
Conversely, if (c) holds, then fo and fe+ s & must be equal almost surely, hence 



c(9 + sS) = c(V0 + logE^(e^ 0+sS -^) 




(25) 



c{6) + s(y,6) 



by (4) 



log/ e+s5 H - log/ fl (o;) = s{Y(lj),S) - c(9 + sS) + c{6) 
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almost surely, hence (Y, S) is constant almost surely, and the constant must be 
(y, S); hence (e) holds. We have now proved that (a) through (f) are equivalent. 

By definition of normal cone and convex support, (e) and (g) are equivalent, 
and (g) and (h) are equivalent by the polarity relationship of normal and tangent 
cones (14, Theorem 6.9 and Corollary 6.30). □ 

Proof of Corollary 2. By Theorem 7.1 and p. 140 in (2), I is concave; thus we 
must have 

l(te 1 -(l-t)§ 2 )>tl(§ 1 ) + (l-t)l(§ 2 ), 0<«<l, (26) 

and since d\ and 2 are MLE, (26) must actually hold with equality. Thus by 
(a) of the theorem B\ — 62 is a direction of constancy. □ 

For the proof of Theorem 3 we use Corollary 2.4.1 in (5), which relies on 
Theorem 2.3 in (5), but the proof of that theorem given in (5) is murky at best. 
So we give a corrected version. 

Corrected Proof of Theorem 2.3 in (5). Equation (2.5) in (5) contains an obvi- 
ous typographical error. It should read 

( 1 V log CO? + log CO) 

(rclogc)(0) = lim 



lim log 



c(9 + s 



s 



z{9)e s,JK ^ 



The rest of the proof of the X(H<p) > case is correct. In the proof of the of the 
A(iJ^) = case, the last displayed formula of the proof is incorrect. Clearly 

e a - aKW F e (A) 1/s -> e a - aK ^\ as s -> 00. 

However, since a < ctr-(</>) was arbitrary, the limit can be made arbitrarily close 
to 1, and we see that 



c(6 + s(j)) 1 1/S 



as s — > 00, 



c(6>)e stT ^) 

as is required for the completion of the proof. □ 

Proof of Theorem 3. The equivalence of (a) and (b) is Theorem 8.6 in (13). The 
equivalence of (a) and (c) is Corollary 2.4.1 in (5). The equivalence of (c) and 
(d) is mutual absolute continuity of the distributions in an exponential family. 
The equivalence of (c) and (e) is immediate from our definition (8) of the normal 
cone. The equivalence of (e) and (f) is the polarity relationship of tangent and 
normal cones (14, Theorem 6.9 and Corollary 6.30). □ 

Proof of Theorem 4- That (a) and (b) are equivalent is Theorem 2.5 in (5). That 
(b) and (c) are equivalent follows from (g) of Theorem 1 and (e) of Theorem 3. 
That (c) and (d) are equivalent is the polarity relationship of tangent and normal 
cones. □ 
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Proof of Corollary 5. By assumption s <— > 1(9 + sS) is a nondecrcasing function. 
Suppose to get a contradiction that 

1(9 + Sl S) = 1(9 + s 2 5) (27) 

for some s± and s 2 such that both sides of (27) are finite and s\ < s 2 . In order 
that / be nondecreasing we must have 

1(9 + siS) = 1(9 + sS), s 1 <s<s 2 

but then 5 is a direction of constancy by Theorem 1 (a). □ 

Proof of Theorem 6. Except for the last sentence, this follows immediately from 
Theorem 2.2 in (5). From (4) 

pr e+sS (Y eH) = E^{l H e< Y ' e+sS -^ e+ ^ +c ^} 

= e *(v>«>-c(e+s<5)+c(fl)_ B ^rj ire <y,e-v>-c(e)+c(v)j 

= e s(v,S)-c(8+s8)+c(8) Wg ( Y eH) 

where Ih denotes the indicator function of the event Y E H. By Corollary 5, 
the function s ^ (y,9 + sS) — c(9 + s6) is strictly increasing, hence so is 
s i — ► We+ss(Y 6 H). That pr e+sS (Y e H) — > 1 as s — > oo follows from Schcffc's 
lemma (see the comments following the theorem). The continuity assertion fol- 
lows from the fact that the moment generating function of the random variable 
(Y,S) is 

E e {e s ^} = E^ Y - e+s5 -^- c ^ +c ^} 

= e c(0+s5)-c(0) 

Hence s i— > c(9 + sS) is actually infinitely differcntiable and so is s t— > ~pr s+sS (Y £ 
H). □ 

Proof of Theorem 7. Suppose L = V. Then con(pos V) is the subspace spanned 
by V, in which case a GDOR does not exist by Theorem 4. 

Suppose L ^= V. Then by the polarity relationship of normal and tangent 
cones for each v £ V\L there exists 5 V € Nc(y) such that (v,S v ) < 0. Hence 
— S v £ Nc(y) and Nc(y) is not a vector subspace. So a GDOR does exist by 
Theorem 4. 

Let 5* = J2 V& 

V \ L S V . Then S* satisfies (15a) and (15b). Observe that 6 e 
Nc(y) if and only if (15a) holds and (15b) holds with < replaced by <. Then 
it is clear that for every 5 £ Nc(y) there exists t > 1 such that tS* + (1 — t)S 
is in Nc(y)- Hence S* G rint Nc(y) by Theorem 6.4 in (13). It now follows 
from Proposition 2.42 in (14) that the set of points satisfying (15a) and (15b) 
is rint Nc(y)- □ 

Proof of Corollary 8. In the proof of the theorem we saw that if a GDOR exists, 
then L ^ V and Nc(y) is not a vector subspace. □ 
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Proof of Corollary 9. Since C is polyhedral convex, every tangent vector is of 
the form s(w — y) for some w E C and s > 0, that is, the closure operation 
in (7) is not necessary. This implies, in particular, that for each v E L there 
exist points w v ,+ and w„ _ in C and positive scalars s c> + and s Vi - such that 
±v = s Vt ±(w Vl ±—y). Observe that these w v> ± are also in CCiH, but no w E V\L 
is in C n H. Thus TcnH(y) = con(posL) = spanL. Since y + span!/ C H, we 
have Cf)HDCn(y + spani). If C D H (f_ C f) (y + spani), then we cannot 
have Tcnii(y) = spani. □ 

Proof of Theorem 10. The polar of a convex cone K is 

K* = { 8 : {w, 6} < 0, w EK} 

(14, Section 6.E). The double polar theorem (14, Corollary 6.2.1) says that 
K** = cl K. When K is closed, in particular when K is polyhedral, then K** = 
K. Here let K = con(pos(V^ u b \ {w})). Then the feasible region for the linear 
program (19) is —K*. Now the optimal value to (19) is nonpositive if and only 
if (w,S) < for all 5 E —K*, which is equivalent by the double polar theorem 
to w E (-K*)* = —K or to —w E K. 

Now w is in (18) if and only if — w is a linear combination of elements of 
V su h with nonnegative coefficients, that is, if — w = a ■ w + J2vev b \{w} av ' v 
where a and all the a v are nonnegative scalars. But this happens if and only if 
— w = J2vev b \{w}( av /(^ + a )) ' u > which is equivalent to — w E K. □ 
Proof of Theorem 11. With probability one 

Y-y= b v (Y)-v 

where all the coefficients b v (Y) are nonnegative. From (20a) and (20b) we can 
derive 

(v,MS) = 0, vEL sat 
(v, MS) < 0, v E V aa t \ 

Hence 

(Y-y,M5)= Y, b v (Y)-(v,M5) (28) 

•u6V aat \L 3 at 

and since all of the (v, MS) in (28) are strictly negative, the sum can only be 
zero if all the b v (Y), v E V sa ± \ L asAi are zero. Thus the support of the limiting 
conditional model consists of points of the form y + X^gL ^ v • v, where the 
coefficients are arbitrary. Since all such points are in the preimage of H sn b under 
the map y >— > M T y, we conclude (21) holds. □ 
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